Skip to content

ci: add libvirt diagnostics and network activation for CoreOS TMT SSH failures#2157

Closed
Copilot wants to merge 2 commits into
mainfrom
copilot/fix-coreos-tmt-ssh-provisioning
Closed

ci: add libvirt diagnostics and network activation for CoreOS TMT SSH failures#2157
Copilot wants to merge 2 commits into
mainfrom
copilot/fix-coreos-tmt-ssh-provisioning

Conversation

Copy link
Copy Markdown

Copilot AI commented Apr 22, 2026

The test-coreos job fails intermittently with an opaque SSH connectivity check failed after 60 attempts error — no VM/network state is captured, making root cause analysis impossible.

Changes

crates/xtask/src/tmt.rs

  • Added dump_libvirt_diagnostics(sh, vm_name) called on SSH verification failure in run_tmt, emitting:
    • virsh list --all, virsh net-list --all, virsh net-dhcp-leases default
    • virsh dominfo, virsh domiflist, virsh dumpxml for the failing VM
    • All commands are best-effort (.ignore_status()) so diagnostics never block cleanup

.github/workflows/ci.ymltest-coreos job

  • Added a pre-test step to ensure the libvirt default network is active before launching the CoreOS VM; prints DHCP leases for confirmation:
- name: Ensure libvirt default network is active
  run: |
    virsh net-list --all
    if virsh net-info default >/dev/null 2>&1; then
      virsh net-start default || echo "Note: ..."
      virsh net-autostart default || echo "Note: ..."
    fi
    virsh net-list --all
    virsh net-dhcp-leases default || echo "Note: no DHCP leases yet"

The network activation step eliminates the most likely root cause (inactive default network / no DHCP). The diagnostics dump ensures any remaining failures produce actionable output instead of a bare timeout.

Original prompt

Fix flaky/failing CoreOS TMT SSH provisioning in CI (bootc-dev/bootc)

Context

  • CI workflow: .github/workflows/ci.yml contains a test-coreos job that builds an image and runs just test-tmt-on-coreos plan-bootc-install-on-coreos.
  • A recurring failure is:
    • Waiting for SSH... attempt 60/60
    • SSH verification failed for plan /tmt/plans/tests-install/plan-bootc-install-on-coreos: Verifying SSH connectivity: SSH connectivity check failed after 60 attempts
    • This results in: error: Recipe test-tmt-on-coreos failed ... exit code 1
  • The failure appears in workflow run 24793399494 (commit 2a459ac9b31eac38a294e3febec3811cbb3ccc6e, PR tests: Skip BLI detection test when discoverable-partitions config is absent #2155) and is not caused by the test change; it looks like libvirt/network/provisioning/guest boot regressions or missing diagnostics.

Goal

  • Make the CI failure actionable and/or eliminate the root cause by ensuring libvirt networking is up before starting the CoreOS VM and by capturing strong diagnostics when SSH never comes up.

Required changes

  1. Add robust debug capture when SSH connectivity verification fails in the CoreOS TMT harness.

    • Identify the script used by just test-tmt-on-coreos (likely hack/test-tmt-on-coreos or a similar wrapper).
    • Add an ERR trap or explicit failure handler that prints:
      • virsh list --all
      • virsh net-list --all
      • virsh net-dhcp-leases <network> (at least default)
      • virsh dominfo <domain> / virsh domiflist <domain> / virsh dumpxml <domain> for the relevant domain.
    • Ensure it still cleans up the VM after failure.
  2. Ensure libvirt network is active before starting CoreOS tests.

    • In .github/workflows/ci.yml test-coreos job, add a step before running the CoreOS test to check and start the libvirt network (default or the one used by the harness).
    • Print the network list and DHCP leases.
  3. Keep changes minimal and safe.

    • Do not weaken test coverage beyond what’s necessary.
    • Prefer making the environment deterministic and adding diagnostics.
  4. Update documentation/comments as needed.

Deliverables

  • Commit changes on a branch suitable for PR.
  • Update/adjust any related Justfile targets if needed.

References

  • Failing job log excerpt indicates failure in test-tmt-on-coreos recipe around line 211 and the failing TMT plan tmt/plans/tests-install/plan-bootc-install-on-coreos.
  • CI workflow file: .github/workflows/ci.yml.

User context

  • Current date: 2026-04-22
  • User login: jmarrero

The following is the prior conversation context from the user's chat exploration (may be truncated):

User: Please find a solution for this failing job. Use the logs, job definition, and any referenced files where the failure occurred. Keep your response focused on the solution and include code suggestions when appropriate.

Assistant: The job is failing because the Fedora CoreOS VM never becomes reachable over SSH during the TMT plan tests-install/plan-bootc-install-on-coreos:

  • Log shows repeated: Waiting for SSH... attempt 60/60
  • Then: SSH verification failed ... SSH connectivity check failed after 60 attempts
  • This bubbles up as: error: Recipe test-tmt-on-coreos failed ... exit code 1

This is almost always one of:

  1. the VM didn’t boot (bad disk/image, kernel args, ignition, etc.),
  2. the VM booted but has no working network (no DHCP/forwarding on the libvirt network, wrong NIC model, etc.),
  3. SSH isn’t available/allowed (sshd not running, wrong user/key, ignition didn’t apply).

Targeted solution: make the CoreOS provision step fail earlier with actionable diagnostics, and ensure the libvirt network is usable

Right now, the harness times out without capturing why the guest isn’t reachable. Update the just test-tmt-on-coreos ... wrapper (the logs indicate it’s implemented by hack/test-tmt-on-coreos, failing around “line 211”) to dump libvirt/QEMU state when SSH doesn’t come up.

Code suggestion (add a failure trap + libvirt diagnostics)

In hack/test-tmt-on-coreos, wrap the run so that on failure you automatically print:

  • virsh list --all
  • virsh dominfo <domain>
  • virsh domiflist <domain>
  • virsh net-list --all
  • virsh net-dhcp-leases default (or the network you use)
  • virsh console capture if possible (or virsh dumpxml <domain> at minimum)

Example pattern:

#!/usr/bin/env bash
set -euo pipefail

domain=""
dump_debug() {
  echo "=== libvirt debug ===" >&2
  virsh list --all || true
  virsh net-list --all || true
  if [[ -n "${domain}" ]]; then
    virsh dominfo "${domain}" || true
    virsh domiflist "${domain}" || true
    virsh dumpxml "${domain}" || true
  fi
  # If using default network, leases are critical for “no DHCP”
  virsh net-dhcp-leases default || true
}

trap dump_debug ERR

# ...existing code that sets/derives the libvirt domain name...
# domain="bootc-tmt-...."
# run tmt / bcvk provisioning that waits for ssh

This won’t “fix” networking by itself, but it turns the current opaque timeout into a...

This pull request was created from Copilot chat.

Add dump_libvirt_diagnostics() in tmt.rs to capture virsh state when
SSH connectivity verification fails (VM list, network list, DHCP leases,
dominfo, domiflist, dumpxml).

Add a CI step in the test-coreos job to ensure the libvirt default
network is active before running CoreOS tests.

Assisted-by: GitHub Copilot (claude-sonnet-4.5)

Agent-Logs-Url: https://github.com/bootc-dev/bootc/sessions/fc40be37-2aea-41f1-b021-4f5b1d804a4a

Co-authored-by: jmarrero <1894385+jmarrero@users.noreply.github.com>
Copilot AI changed the title [WIP] Fix flaky CoreOS TMT SSH provisioning in CI ci: add libvirt diagnostics and network activation for CoreOS TMT SSH failures Apr 22, 2026
Copilot AI requested a review from jmarrero April 22, 2026 19:12
@jmarrero jmarrero closed this Apr 23, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants